Text mining online data with scikit-learn

Text mining has a large variety of applications and is becoming used in more businesses for gathering intelligence and providing insight. People are sending text constantly online via social media, chat rooms and blogs. Tapping into this information can help businesses gain an advantage and is increasingly a necessary skill for data analytics. Text mining is a unique data mining problem, dealing with real world data that is often heavy on artefacts, difficult to model and challenging to properly manage. Text mining can be seen as a bit of a dark art that is difficult to learn and gain traction. However some basic strategies can often be applied to get good results quite quickly, and the same basic models appear in many text mining challenges.

The scikit-learn project is a library of machine learning algorithms for the scientific python stack (numpy & scipy). It is known for having detailed documentation, a high quality of coding and a growing list of users worldwide. The documentation includes tutorials for learning machine learning as well as the library and is a great place to start for beginners wanting to learn data analytics. There is a strong focus on reusable components and useful algorithms, and the text mining sections of scikit-learn follow the “standard model” of text mining quite well.

In this presentation, we will go through the scikit-learn project for machine learning and show how to use it for text mining applications. Real world data and applications will be used, including spam detection on Twitter, predicting the author of a program and determining a user's political bent based on their social media account.

Who am I

Robert Layton

  • Research Fellow at Federation University Australia
  • Data analyst, lots of text
  • scikit-learn contributor (including GSoC mentor)

In [1]:
import sys
print("Python: {}".format(sys.version))


Python: 3.4.0 (default, Apr 11 2014, 13:05:11) 
[GCC 4.8.2]

In [2]:
import numpy as np
import sklearn
print("scikit-learn: {}".format(sklearn.__version__))


scikit-learn: 0.15.0

What this talk is about

Using scikit-learn for text mining.

  • I'll go through a standard category example
  • Then doing spam detection on Twitter

Then I'll quickly talk about some other options and how they would integrate into scikit-learn's framework.

What this talk is not!

This isn't a talk about:

  • NLTK -- A great library, but I'm trying to be focused
  • Pandas -- Again, great, but out of scope
  • Obtaining online data -- maybe another year!
  • Ethics -- for this talk, just assume we got consent

In [3]:
# Let's get some data
from sklearn.datasets import fetch_20newsgroups
newsgroups_train = fetch_20newsgroups(subset='train', categories=['alt.atheism', 'soc.religion.christian', 'talk.politics.guns'])

from pprint import pprint
pprint(list(newsgroups_train.target_names))


['alt.atheism', 'soc.religion.christian', 'talk.politics.guns']

In [4]:
# Let's split the dataset into two sets: training and testing
from sklearn.cross_validation import train_test_split
docs_train, docs_test, y_train, y_test = train_test_split(newsgroups_train.data, newsgroups_train.target)
print("Number of training documents: {}".format(len(docs_train)))
print("Number of testing documents: {}".format(len(docs_test)))


Number of training documents: 1218
Number of testing documents: 407

In [5]:
# First: how are we going to evaluate?
# F-score -- related to accuracy, based on precision and recall
# Hard to "fake" for unbalanced datasets
from sklearn.metrics import f1_score

A basic example

Machine learning algorithms don't typically take text as input.

Instead, we need to convert the text into a vector:

$x_i = [x_{i,0}, x_{i,1}, x_{i,2}, ... ]$

We can do that easily by simply choosing a list of words and counting how frequently they occur:

$x_{i, j}$ is the frequency of word $j$ in document $i$.

We can manually choose our list of words, or we can set it from the data. Usually we set from the data, for example "The least frequently occuring words". This is called the "bag of words" model, and scikit-learn has is built-in:


In [6]:
from sklearn.feature_extraction.text import CountVectorizer

# Fit on training data
model = CountVectorizer()
X_train = model.fit_transform(docs_train)

# Vocabulary is the words used, and is a dict, available in model.vocabulary_
# They map the words to the indices
pprint(list(model.vocabulary_.items())[:10])


[('outgrowth', 15427),
 ('lpf', 13270),
 ('genocide', 9865),
 ('san', 18510),
 ('renouncing', 17802),
 ('mothers', 14301),
 ('swimmer', 20446),
 ('fifth', 9195),
 ('buyer', 4573),
 ('manhattan', 13482)]

In [7]:
# X_train gives us our bag of words matrix: X_train[i][j] is the value of word with index j for document with index i
# It is a sparse matrix, which we will get to later on
print(type(X_train))


<class 'scipy.sparse.csr.csr_matrix'>

In [8]:
keyword = "believe"
documents_containing_keyword = [index for index in range(len(docs_train)) if keyword in docs_train[index]]
assert len(documents_containing_keyword) > 0
keyword_index = model.vocabulary_[keyword]
document_index = documents_containing_keyword[0]
print(keyword_index in X_train[document_index].nonzero()[1])
print("The keyword {} appears {} times in document {}".format(keyword, X_train[document_index,keyword_index], document_index))


True
The keyword believe appears 1 times in document 12

In [9]:
# Let's compare some words, and how they differ between categories
words = ["believe", "right", "bible"]
word_indices = [model.vocabulary_[word] for word in words]
classes = sorted(set(y_train))
categories = newsgroups_train.target_names
#print(X_train[y_train == 0, word_indices[0]])
frequency = np.array([[X_train[y_train == category,wi].mean() for wi in word_indices]
                       for category in classes]).T

assert frequency.shape == (len(word_indices), sum(set(y_train))), frequency.shape
print(frequency)


[[ 0.53351955  0.702407    0.37717122]
 [ 0.32960894  0.26039387  0.56823821]
 [ 0.34078212  0.64989059  0.00496278]]

In [10]:
%matplotlib inline
# Setup the plot
from matplotlib import pyplot as plt
ind = np.arange(frequency.shape[0])
width = 0.2

colors = "rgbyck"

In [11]:
fig = plt.figure(figsize=(20, 8))
ax = fig.add_subplot(111)
for column in range(frequency.shape[1]):
    ax.barh(ind + (width * column), frequency[:,column], width, color=colors[column], label=categories[column])
ax.set(yticks=ind + width, yticklabels=words, ylim=[len(words)*width - 1, frequency.shape[0]])
ax.legend(bbox_to_anchor=(0.9, 0.8))
r = plt.xlim((0, 1))
plt.show()



In [12]:
# We can use our existing model to transform the test documents in the same way
# Because we don't fit again, the indices match with the previouc
X_test = model.transform(docs_test)

In [13]:
# Then we build a basic classifier and test it out
from sklearn.svm import SVC
clf = SVC().fit(X_train, y_train)
y_pred = clf.predict(X_test)
print("F1-score: {:.3f}".format(f1_score(y_test, y_pred)))
print("Accuracy: {:.3f}".format(np.mean(y_test == y_pred)))


F1-score: 0.527
Accuracy: 0.597
/usr/lib/python3/dist-packages/scipy/sparse/compressed.py:119: UserWarning: indptr array has non-integer dtype (float64)
  % self.indptr.dtype.name)

In [14]:
# Let's put that all into a short snippet:
text_model = CountVectorizer()
clf_model = SVC()

# Convert documents to vectors
X_train = text_model.fit_transform(docs_train)
X_test = text_model.transform(docs_test)

# Train classifier
clf_model.fit(X_train, y_train)
y_pred = clf_model.predict(X_test)

# Evaluate
print("F1-score: {:.3f}".format(f1_score(y_test, y_pred)))
print("Accuracy: {:.3f}".format(np.mean(y_test == y_pred)))
# The results will change, as there is some randomness.
# We can usually address that using random_state, but that is out of scope for today.


F1-score: 0.527
Accuracy: 0.597

Spam Detection on Twitter

Spam on social media is a big problem -- it ruins the user experience for people, wastes resources for companies and is just generally annoying. It can also propogate crime, but allowing criminals to advertise their goods without having to go through other channels. Luckily, the same techniques as above can be applied here.


In [15]:
# Get a dataset of spam and non-spam twitter posts
# Offline, I collected a bunch of twitter posts and manually sorted through a few thousand.
# This is a pretty easy job for a human, but can be hard for machines!
import numpy as np

def load_twitter_data(filename='tweets_spam.csv'):
    documents = []
    classes = []
    with open(filename) as inf:
        for line in inf:
            data = line.split(",")
            documents.append(",".join(data[:-1]))
            classes.append(int(data[-1]))
    classes = np.array(classes, dtype='int')
    return documents, classes

In [16]:
documents, classes = load_twitter_data()
print("Loaded {} documents, {} are spam and {} are not".format(len(documents), sum(classes == 1), sum(classes == 0)))
docs_train, docs_test, y_train, y_test = train_test_split(documents, classes)
print("Number of training documents: {}".format(len(docs_train)))
print("Number of testing documents: {}".format(len(docs_test)))


Loaded 502 documents, 104 are spam and 398 are not
Number of training documents: 376
Number of testing documents: 126

In [17]:
# Let's use our previous model and see how that goes...

text_model = CountVectorizer()
clf_model = SVC()

# Convert documents to vectors
X_train = text_model.fit_transform(docs_train)
X_test = text_model.transform(docs_test)

# Train classifier
clf_model.fit(X_train, y_train)
y_pred = clf_model.predict(X_test)

# Evaluate
print("F1-score: {:.3f}".format(f1_score(y_test, y_pred)))
print("Accuracy: {:.3f}".format(np.mean(y_test == y_pred)))


F1-score: 0.000
Accuracy: 0.786
/usr/local/lib/python3.4/dist-packages/sklearn/metrics/metrics.py:1771: UndefinedMetricWarning: F-score is ill-defined and being set to 0.0 due to no predicted samples.
  'precision', 'predicted', average, warn_for)

In [18]:
# That's odd -- high accuracy, (very) low f-score
# A confusion matrix will show us the issue here
from sklearn.metrics import classification_report
print(classification_report(y_test, y_pred))


             precision    recall  f1-score   support

          0       0.79      1.00      0.88        99
          1       0.00      0.00      0.00        27

avg / total       0.62      0.79      0.69       126

/usr/local/lib/python3.4/dist-packages/sklearn/metrics/metrics.py:1771: UndefinedMetricWarning: Precision and F-score are ill-defined and being set to 0.0 in labels with no predicted samples.
  'precision', 'predicted', average, warn_for)

In other words, the classifier is "cheating" by just saying all entries are not spam. This is very accurate, but only because most of our dataset is not spam. Instead, let's try a bunch of different parameters.

We can build a pipeline to handle this for us, and that let's us try different parameters.


In [19]:
from sklearn.pipeline import Pipeline
text_model = CountVectorizer()
clf_model = SVC()
pipeline = Pipeline([('vectorizer', text_model),
                     ('classifier', clf_model)
                      ])
# With the pipeline define, we set out parameters in a dictionary, which we can specify as ranges
params = {
            # Vectoriser parameters
            'vectorizer__ngram_range': [(1,3), ],  # n-grams are subsequences of "tokens"
            'vectorizer__analyzer': ['word',],  # words are our tokens
            'vectorizer__min_df': [2, 3],  # n-grams need to appear in at least this many documents in the dataset
            # Classifier parameters
            'classifier__C': [0.1, 1.0, 10, ],  # See Support Vector Machines information
            'classifier__kernel': ['rbf', 'linear'],
           }

But... how do we choose out parameters? We can't use our existing test set, due to overfitting. Overfitting is the effect of training a model too specifically.

Typically this is handled by performing cross fold validation, which scikit-learn supports for pipelines.


In [20]:
from sklearn.grid_search import GridSearchCV
from sklearn.cross_validation import StratifiedKFold
grid = GridSearchCV(
                   pipeline,  # Our pipeline from above
                   params,  # The parameters we set above
                   refit=True,  # Ignore me for now
                   n_jobs=1,  # We can set this to the number of cores to use, or -1 for "all"
                   scoring='f1',  # f1 score
                   cv=StratifiedKFold(y_train, n_folds=3),  # What type of cross fold validation to use
                )

In [21]:
# Now, we fit out model
grid.fit(docs_train, y_train)

# and then evaluate!
y_pred = grid.predict(docs_test)

# Evaluate
print("F1-score: {:.3f}".format(f1_score(y_test, y_pred)))
print("Accuracy: {:.3f}".format(np.mean(y_test == y_pred)))
print(classification_report(y_test, y_pred))


F1-score: 0.681
Accuracy: 0.881
             precision    recall  f1-score   support

          0       0.90      0.96      0.93        99
          1       0.80      0.59      0.68        27

avg / total       0.88      0.88      0.87       126

By running a bunch of parameters, we can find a model using training data that better represents our data. The above report can be roughly summarised as:

  • When we predicted something as spam, we were right 80% of the time
  • Of the things that are spam, we found 59% of them

So not bad!

From here, you need to define what your evaluation really means. If you are building a spam detector for twitter, you may not care about missing legitimate content -- in this case, you can use an evaluate measure that finds more spam, but with more "false positives". If you are building one for your work email, you probably don't want to miss that email from your boss. For this reason, you would penalise "false positives" more, leading to more spam in your inbox, but less legitimate emails missed.

One of the key parameters here was the n-gram, which allows us to choose several words in a row as a feature. For twitter spam, often "normal" words are used, so looking at single words is unlikely to give us a good result -- they occur in normal and spam tweets with approximately the same frequency.

Other possible applications

Authorship analysis is a growing field aiming to recognised the author of a document, using only the content of that document. Typically this is performed using natural language documents (I covered this briefly last year).

However, it works for software too! Using the same types of features, you can often predict which of a set of authors wrote a given document.

We can do this by looking at character n-grams instead of word n-grams.


In [22]:
# We can use our previous pipeline, but will use characters instead of words as our tokens
# With the pipeline define, we set out parameters in a dictionary, which we can specify as ranges
params = {
            # Vectoriser parameters
            'vectorizer__ngram_range': [(1,3), ],  # n-grams are subsequences of "tokens"
            'vectorizer__analyzer': ['char',],  # characters are our tokens
            'vectorizer__min_df': [2, 3],  # n-grams need to appear in at least this many documents in the dataset
            # Classifier parameters
            'classifier__C': [0.1, 1.0, 10, ],  # See Support Vector Machines information
            'classifier__kernel': ['rbf', 'linear'],
           }

Other applications from research include applying LIWC to predicting a social media's political stance.

See: Tumasjan, A., Sprenger, T. O., Sandner, P. G., & Welpe, I. M. (2010). Predicting Elections with Twitter: What 140 Characters Reveal about Political Sentiment. ICWSM, 10, 178-185.

LIWC enhances word vectors with information about the contexts in which they are used -- a bit like semantics on steroids.

Thanks!

You can reach me at robertlayton@gmail.com